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1. INTRODUCTION 

Recent advances in information technology and computer systems have deeply impacted our 
day-to-day life. One recent application of information technology, which has great potential, is in the 
interaction between human and computer. Gesture recognition is a natural communication tool, offering a 
powerful means of interaction humans and computers. Traditional means of input such as keyboards and 
mouse reduce the speed of communication between computer and human. On the other hand, hand gestures 
can be used to recognize the letters of the English alphabet. 

Hand gestures are an indispensable means of communication for people who are speech and hearing 
impaired. In computers, recognition of continuous gesture patterns is possible by using an artificial neural 
network (ANN) [1]-[3]. One advantage of using hand gestures in computers is that visual interpretation will 
help in user ease and spontaneity in human computer interaction (HCI) [4]-[7]. This study describes an 
accurate gesture detection system designed for use with convolution neural network (CNN). Possible 
applications of this system include computer games, machinery control and related uses. Proposed work does 
not need gloves with special sensors. We use video graphic array (VGA) camera to capture the hand gestures. 
To acquire data some, hand gesture recognition systems need data glove [8], [9]. In a gesture recognition 
system, the motion of a person’s fingers or arms are used to convey information. A hand gesture recognition 
system interprets the meanings conveyed by gestures. In such make it systems, the features of gestures are 
extracted from images and are used to form feature vectors. These vectors are mapped to the original data set. 
There are several gesture recognizing software among which the most common is the American sign 
language (ASL) [10], [11]. ANN are parallel distributed processors with simple processing units called 
neurons. ANN is used to acquire, store and utilize the knowledge which is acquired through learning or 
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training, which helps ANN to train itself for all possible cases. ANN is widely used in applications like image 
and voice recognition [12], [13]. 


2. RELATED WORK 

Maung developed a supervised neural network system using MATLAB toolbox to recognize real 
time gestures of the Myanmarese alphabet. The system was designed for speed and did not use complex 
hardware. The input images for the system were digitized photographs with which feature vectors were 
generated using histograms. The vectors in turn were fed into the neural network (NN) system. Because of 
the MATLAB tool, although the design was not complex, the time taken for implementation was large [14]. 

In another work, Chen et al. presented a new method for hand gesture recognition wherein a 
background subtraction method was utilized to detect the hand region from the background. Further, by using 
a segmentation technique, fingers and palms were segmented. A simple rule-based classifier was used to 
recognize the hand gestures. The proposed algorithm, yielded a good overall accuracy of 96.6% on the 
dataset of 1,300 images [15]. 

A paper presented by Fu et al. described a wavelet-based image preprocessing technique for gesture 
recognition. The authors demonstrated a method for feature extraction, which was tested with six different 
hand gestures. Their paper described methods for obtaining 1-dimensional signals using 2-dimensional hand 
gesture contour images. For |-dimensional signals, the system decomposes the wavelets. The system could 
also extract statistical features of the wavelet coefficients. However, the conversion to 1-D conversion from 
2-D affected the accuracy of neural network and thus, could be applied only to a few hand gestures [16]. 

Yamato et al., in their paper, discussed a system that could recognize gestures using three models. 
The results obtained from each model are integrated to obtain a composite result. In this model, audio and 
motion are learned by the hidden markov model (HMM), whereas random forest (RF) is used to learn the 
video model. Here the uni-modal and multi-modal models ware compared for determining the accuracy of 
recognition [17]. 

Bobic et al. proposed a method of recognition of hand gestures using neural networks. The authors 
used multiple background and space orientations to capture images. A histogram of oriented gradients was 
used for feature extraction and backpropagation algorithm for training. In another method, the authors 
implemented a sparse auto encoder. In this method, more gestures were used for training and less for 
recognition. Another limitation was that the authors static hand gestures in their study [18]. 

Badi et al., in their paper, proposed that images are pre-processed and classified using ANN. During 
the preprocessing stage, edge detection, homogeneity, and other filtering operations ware performed. And 
then by using complex hand contour and Ahzat methods the lines are extracted from the hand gestures [19]. 


3. PROPOSED SYSTEM 

The hand gesture recognition experiment was conducted using web camera and required a white 
background with sufficient illumination. A particular gesture was shown in front of camera and action was 
identified by the trained model. This process had both training and testing phase. In Figure 1, we shown the 
system diagram of proposed work. Different hand gestures were captured and were provided to the system as 
an input. The proposed method used desktop system and web camera. To obtain the gestures, user had to 
show his hand in front of camera. From video frame red green blue (RGB) image was extracted. Then these 
images were converted into hue, saturation and value (HSV) type. 

The major applications of CNNs are for image recognition, pattern recognition, speech recognition 
and natural language problems [20]. A Convolution neural model consists of one or more convolution layers, 
pooling layers and fully connected layers. Kernel convolution is used in CNNs. It is the process where we 
take a small matrix of numbers and we call it as filter or kernel. Then it is passed over an image and 
transforms it. The main objective of the proposed work was to design a system for recognition of hand 
gestures for 26 English alphabets using CNN with equitable accuracy. Subsequent feature map (f) values are 
calculated according to following formula where the input image is denoted by X and our kernel by h: 


f [a,b] = (X*h)[a,b] =) » Ala — m, b — n] x h[m,n] (1) 


where m, n are matrix indices. 
The CNN used in our system has seven layers-three convolutional, three max-pooling layers, and 
one fully connected. Our first convolutional layer consists of 32 filters of size 3x3. We used rectified linear 
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unit (ReLU) activation. ReLU function is a mathematical calculation which clips negative values to zero and 
if it returns positive values will be unchanged. Mathematically, it is represented as 


Relu(x) = max(0, x) (2) 


SD an 


FEATURE EXTRACTION 


Figure 1. System diagram of hand gesture recognition 


The next layer is max polling layer. Max polling layer provides the reduction in spatial dimension 
(length and width). It is used to reduce size of image by taking the maximum value in the window. In our 
system a max pooling layer of 2x2 was used with a stride of two in both directions. Thirty-two filters of size 
3x3 and ReLU activation were used in the second convolutional layer of our system. The pool size of a 
second max pooling layer was 2x2. Sixty-four filters of size 3x3 comprise the third convolutional layer. 

The max pooling layer used is with pool size 2x2. Max pooling layers help to reduce the number of 
parameters for large images. Max pooling take the largest element from the feature map. The fully 
connected layer is the final one, where all the neurons of present layer and the neurons of next layer are 
connected with each other. The fully connected layer feed forward neural network helps in computing class 
scores. Input to this layer is from the last pooling layer. The output of the pooling layer is flattened and 
applied to feed forward neural network. Flattening is the process of unrolling matrix values into vectors. At 
every layer, the following calculation takes place: 


o=>™,WixXitb (3) 


where Xi is input vector, Wi is weight vector and b is bias. 

This is followed by SoftMax activation. The Soft-max function used here is for classification 
purpose. Probability distribution is the output soft-max layer that is the values of the output sum to 1. The 
output of softmax function represents a distribution over class labels. It also obtains the probability of each 
input element belonging to a label. Mathematical model of SoftMax is 


. a} . 
o(@2)j = sro for j = 1....k (4) 


Gesture recognition problem has 26 neurons in the output layer that means we have 26 classes. 


4. RESULTS 

We implemented convolution neural network using keras library and Google Tensor Flow [21]— 
[25]. The gesture recognition system was trained with the sign language provided in Modified National 
Institute of Standards and Technology (MNIST) dataset. Six rounds of experiments were carried out using a 
different training and testing mode each time to obtain optimum results. The test results obtained from our 
experiments are presented in Table 1 (see in appendix). Training accuracy obtained was 91%. 


5. CONCLUSION 

We have proposed and tested a web camera-based approach for hand gesture recognition using 
convolution neural network to recognize different hand gestures. Our model was evaluated using a hand 
gesture dataset. The results of our experiments demonstrate that gesture recognition can attain 91% 
accuracy. 
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APPENDIX 


Table 1. Test results 


Test Case ID Description Input Accepted Output Actual Output 


This is the 1“ letter 
A from ASL in the 
dataset. | 


This is the 2™ letter 
B from ASL in the 
dataset. 


This is the 3™ letter ' 
Cc from ASL in the ; 
dataset. | 


——= 


This is the 4" letter . 
D from ASL in the 
dataset. 


This is the 5" letter i 
F from ASL in the 
dataset. 
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